Sunah Park, 2. Jan 2019

This markdown file is created by Sunah Park for extended lab exercises in the book An Introduction to Statistical Learning with Applications in R (ISLR).


Setup for code chunks


library(MASS)
library(ISLR)
library(ggplot2)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
head(Boston,3)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
Boston<-na.omit(Boston)
sum(is.na(Boston$medv))
## [1] 0
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

1. Simple Linear Regression

lm() function is used to fit a simple linear regression model.

lm.fit<-lm(medv~lstat, data=Boston) # y: medv, x: lstat from Boston database
summary(lm.fit) # infos Std.Error, tvalue, P-value, R-sqaured, F-stat
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
names(lm.fit) # Informations stored in linear model
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
coef(lm.fit) # Coefficients of linear model
## (Intercept)       lstat 
##  34.5538409  -0.9500494
confint(lm.fit) # Confidence interval for the coefficient estimates
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
pred.lm.fit<-predict(lm.fit, data.frame(lstat=seq(1,50,5)),interval="confidence"); pred.lm.fit
##           fit        lwr       upr
## 1  33.6037915  32.564024 34.643559
## 2  28.8535448  28.111211 29.595878
## 3  24.1032980  23.546025 24.660571
## 4  19.3530512  18.753385 19.952718
## 5  14.6028045  13.767222 15.438387
## 6   9.8525577   8.700886 11.004229
## 7   5.1023109   3.604296  6.600326
## 8   0.3520641  -1.505704  2.209832
## 9  -4.3981826  -6.622617 -2.173748
## 10 -9.1484294 -11.743514 -6.553345
pred.lm.fit2<-predict(lm.fit, data.frame(lstat=seq(1,50,5)),interval="prediction"); pred.lm.fit2
##           fit        lwr       upr
## 1  33.6037915  21.347614 45.859969
## 2  28.8535448  16.619011 41.088079
## 3  24.1032980  11.878597 36.327999
## 4  19.3530512   7.126344 31.579758
## 5  14.6028045   2.362259 26.843350
## 6   9.8525577  -2.413620 22.118735
## 7   5.1023109  -7.201218 17.405839
## 8   0.3520641 -12.000428 12.704556
## 9  -4.3981826 -16.811114  8.014749
## 10 -9.1484294 -21.633109  3.336250

The fitted model for simple linear regression: \[ {medv}=34.55-0.95\times{lstat} \]

ggplot(data=Boston, aes(x=lstat, y=medv))+geom_jitter(alpha=.7, color="grey")+stat_smooth(method="lm", level=0.95)+theme_bw(base_family="Avenir")

Simple linear model diagnostics

df<-data.frame(x=Boston$lstat,
               y=Boston$medv,
               y.fit=lm.fit$fitted.values, 
               residuals=lm.fit$residuals, 
               i=seq(1:length(Boston$lstat)))

When fitting a least squares line of linear model, we generally require following conditions.

  1. Linearity. The data should show a linear trend. We look for a random scatter of residuals around 0 (Plot residuals vs. x)
ggplot(df,aes(x=x,y=residuals))+geom_point(alpha=0.3)+geom_hline(yintercept=0, color="blue",linetype="solid",size=0.5)

Diagnose: There is some evidence of non-linearity


  1. Nearly normal residuals with mean 0. Generally the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points (Plot histogram for residuals).
ggplot(data=df, aes(residuals))+geom_histogram(binwidth=1)

ggplot(data=df, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5, col="blue")

Diagnose: There is a right skew


  1. Constant variability of residuals. The variability of points around the least squares line remains roughly constant. In other words, residuals should be equally variable for low and high values of the predicted response variable (Plot residuals vs. y.fit)
ggplot(data=df, aes(x=y.fit,y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5)

Diagnose: The variability of residuals is non-constant.


  1. Independent residuals. If time series structure is suspected check residuals vs. order of data collection (Plot residuals vs. order of data)
ggplot(data=df, aes(x=i, y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5)

Diagnose: There is no increasing or decreasing pattern of residuals. The model does not suffer from time-series structure.

2. Multiple Linear Regression

# Model selection
lm.fit<-lm(medv~., data=Boston) # Full model where y: medv, x: all predictors in Boston 
summary(lm.fit) # infos Std.Error, tvalue, p-value, R-squared, F-stat
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
# Model selection using p-value Backwards elimination (Start with the full model -> Drop the variable with the highest p-value and refit a smaller model -> Repeat until all variables left in the model are significant (p-value<alpha)).
lm.fit<-lm(medv~.-age, data=Boston) # The variable age is now removed from the model, as it has the highest p-value that is higher than the chosen significance level alpha of 0.05 in this case.
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ . - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6054  -2.7313  -0.5188   1.7601  26.2243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
## crim         -0.108006   0.032832  -3.290 0.001075 ** 
## zn            0.046334   0.013613   3.404 0.000719 ***
## indus         0.020562   0.061433   0.335 0.737989    
## chas          2.689026   0.859598   3.128 0.001863 ** 
## nox         -17.713540   3.679308  -4.814 1.97e-06 ***
## rm            3.814394   0.408480   9.338  < 2e-16 ***
## dis          -1.478612   0.190611  -7.757 5.03e-14 ***
## rad           0.305786   0.066089   4.627 4.75e-06 ***
## tax          -0.012329   0.003755  -3.283 0.001099 ** 
## ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
## black         0.009321   0.002678   3.481 0.000544 ***
## lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7343 
## F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16
lm.fit<-lm(medv~.-age-indus, data=Boston) # The variable age and indus are now removed from the model
# Alternatively, the update() function can be used (e.g. update(lm.fit, ~.-age))
summary(lm.fit) # We now have all the variables whose p-value is less than alpha. Our final model has 11 predictors.
## 
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

The final equation is: \[ {medv}=36.34-0.11\times{crim}+0.05\times{zn}+2.72\times{chas}-17.38\times{nox}+3.80\times{rm}-1.49\times{dis}0.30\times{rad}-0.01\times{tax}-0.95\times{ptratio}-0.01\times{black}-0.52\times{lstat} \]

attributes(summary(lm.fit))
## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"
summary(lm.fit)$adj.r.squared
## [1] 0.7348058

The fitted model has adjusted R-squared of 0.73. In other words, 73% of the variability in our response variable medv can be explained by the model.


Multiple linear model diagnostics

df<-data.frame(y=Boston$medv,
               y.fit=lm.fit$fitted.values, 
               residuals=lm.fit$residuals, 
               abs.residuals<-abs(lm.fit$residuals),
               i=seq(1:length(Boston$lstat)))

Multiple regression methods generally depend on the following four assumptions:

  1. Normal probability plot. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.
ggplot(data=df, aes(residuals))+geom_histogram(binwidth=1)

ggplot(data=df, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5, col="blue")

Diagnose: There is a right skew and some outliers, but not highly extreme.


  1. Absolute values of residuals against fitted values. The plot of absolute residuals vs. fitted values is helpful to check the condition that the variance of the residuals is approximately constant.
ggplot(data=df, aes(x=y.fit,y=abs.residuals))+geom_jitter(alpha=.5)

Diagnose: There are three data points which have residual value higher than 20, but not highly significant. The variance is fairly ok.


  1. Residuals in order of their data collection. If time series structure is suspected check residuals vs. order of data collection (Plot residuals vs. order of data)
ggplot(data=df, aes(x=i, y=residuals))+geom_jitter(alpha=.3)+geom_hline(yintercept=0, color="blue", linetype="solid", size=0.5)

Dignose: We see no structure that indicates a time-series issue.


In case of discrete variables:

  1. Residuals against each predictor variable. We check whether the variability doesn’t fluctuate across groups. As here we only have continuous numerical predictors, we skip this model assumption.